t-distributed stochastic neighbor embedding (tSNE) is a nonlinear, nonparametric, and unsupervised dimension reduction machine learning algorithm. It is used to find patterns in high-dimensional data.
Recall that dimension reduction techniques such as PCA help us reduce high-dimensional linear data into a reduced feature space, such as 2 or 3 main axes of “distilled” variation that can be efficiently visualized.
These visualizations often look a little nicer than those for PCA because instead of plotting distances between observations, tSNE plots the probabilities instead! This is based on Kullback-Leibler divergences (the loss function). It becomes difficult to say what PCA data separation looks like in higher-dimensional space because it can be dubious to extrapolate lower dimension representations into higher ones.
Run these lines manually if you need to install or update the following packages:
if (FALSE) {
install.packages(c(
# train/test data splitting
"caret",
# Our sole ML algorithm this time around
"randomForest",
# tSNE algorithms
"Rtsne", "tsne"
))
}
Library the required packages
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(Rtsne)
library(tsne)
iris datasetdata(iris)
# Learn about the dawta
?iris
# View its structure
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# How many of each species?
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
We will fit one model using the tsne package and one using the Rtsne package. Then, we will use the Rtsne model to add coordinates to our dataset and to train and evaluate a random forest algorithm on these new data.
tsne packageHere, the help files outline a concise way to fit the tSNE algorithm via a brief plotting function:
# Define colors for plotting
colors = rainbow(length(unique(iris$Species)))
# Assign one color to each species
names(colors) = unique(iris$Species)
colors
## setosa versicolor virginica
## "#FF0000FF" "#00FF00FF" "#0000FFFF"
# Define the function
ecb = function(x,y){
plot(x,t = 'n')
text(x,labels = iris$Species, col = colors[iris$Species])
}
# Fit
set.seed(1)
system.time({
tsne_iris = tsne::tsne(iris[, -5], epoch_callback = ecb, perplexity = 50)
})
## sigma summary: Min. : 0.565012665854053 |1st Qu. : 0.681985646004023 |Median : 0.713004330336136 |Mean : 0.716213420895748 |3rd Qu. : 0.74581655363904 |Max. : 0.874979764925049 |
## Epoch: Iteration #100 error is: 12.5419603996613
## Epoch: Iteration #200 error is: 0.255642913624415
## Epoch: Iteration #300 error is: 0.243735702264651
## Epoch: Iteration #400 error is: 0.24370348684716
## Epoch: Iteration #500 error is: 0.243703479565549
## Epoch: Iteration #600 error is: 0.243703479562828
## Epoch: Iteration #700 error is: 0.243703479562827
## Epoch: Iteration #800 error is: 0.243703479562828
## Epoch: Iteration #900 error is: 0.243703479562827
## Epoch: Iteration #1000 error is: 0.243703479562827
## user system elapsed
## 14.804 1.273 17.382
Rtsne exampleRtsne provides clearer hyperparameters, better help, and more flexibility compared to the tsne model.
# You might want to remove duplicate observations (even if they are stochastic)... (so that you are not computing distances between two identical points?)
set.seed(1)
Rtsne_iris <- Rtsne::Rtsne(as.matrix(iris[, -5]),
# Return just the first two dimensions
dims = 2,
# Let's set perplexity to 5% of the number of rows
# Try setting it to a larger value as well, like 25%
perplexity = nrow(iris) * 0.05,
# try changing theta to 0.0 to see what happens
theta = 0.5,
# change eta to 0 and see what happens!
eta = 1,
# Tell the algorithm it is okay to have duplicate rows
check_duplicates = F)
# Unpack!
names(Rtsne_iris)
## [1] "theta" "perplexity" "N" "origD" "Y" "costs"
## [7] "itercosts" "stop_lying_iter" "mom_switch_iter" "momentum" "final_momentum" "eta"
## [13] "exaggeration_factor"
# Plot first two dimensions
plot(Rtsne_iris$Y[, 1:2],col = iris$Species)
pca_iris = princomp(iris[,1:4])$scores[,1:2]
plot(pca_iris, t = 'n')
text(pca_iris, labels = iris$Species, col = colors[iris$Species])
Let’s recapitulate Mark Borg’s walkthrough here. Let’s keep working with our Rtsne_iris model from above. cbind the tSNE coordinates into our dataset in order to fit a random forest on this new dataset!
# Add tSNE coordinates via cbind
data = cbind(iris, Rtsne_iris$Y)
# Rename the new columns
colnames(data)[6] = "tSNE_Dim1"
colnames(data)[7] = "tSNE_Dim2"
# Check out the dataset
head(data)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species tSNE_Dim1 tSNE_Dim2
## 1 5.1 3.5 1.4 0.2 setosa 9.111711 -3.6912105
## 2 4.9 3.0 1.4 0.2 setosa 14.571546 3.1283754
## 3 4.7 3.2 1.3 0.2 setosa 16.083600 -0.2395849
## 4 4.6 3.1 1.5 0.2 setosa 15.695034 0.6848413
## 5 5.0 3.6 1.4 0.2 setosa 10.379982 -4.4955670
## 6 5.4 3.9 1.7 0.4 setosa 8.042246 -9.6696125
# Split the data
set.seed(1)
split = caret::createDataPartition(data$Species, p = 0.75, list = FALSE)
training_set = data[split,]
test_set = data[-split,]
# Identify species "target" variable and predictors for train and test sets
X_train = training_set[, -5]
Y_train = training_set$Species
X_test = test_set[, -5]
Y_test = test_set$Species
Fit the random forest:
set.seed(1)
RF = randomForest(X_train, Y_train, X_test, Y_test,
ntree = 500,
proximity = T,
importance = T,
keep.forest = T,
do.trace = T)
predicted = predict(RF, X_test)
table(predicted, Y_test)
## Y_test
## predicted setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 12 1
## virginica 0 0 11
mean(predicted == Y_test)
## [1] 0.9722222
varImpPlot(RF)
tSNE FAQ. Laurens van der Maaten blog.
Cao, Y and L Wang. 2017. Automatic selection of t-SNE perplexity. Journal of Machine Learning Research: Workshop and Conference Proceedings 1:1-7.
Linderman, GC and S. Stenerberger. 2017. Clustering with t-SNE, provably. arXiv:1706.02582 [cs.LG].
Pezzotti et al. 2017. Approximated and user steerable tSNE for progressive visual analytics. IEEE Transactions on Visualization and Computer Graphics 23:1739-1752.
Schubert E. and M. Gertz. 2017. Intrinsic t-stochastic neighbor embedding for visualization and outlier detection: A remedy against the curse of dimensionality? In: Beecks C., Borutta F., Kröger P., Seidl T. (eds) Similarity Search and Applications (SISAP). Lecture Notes in Computer Science, Springer, 10609:188-203.
Wattenberg et al. 2016. How to use t-SNE effectively
colah’s blog. 2015. Visualizing representations: Deep learning and human beings.
Wang W et al. 2015. On deep multi-view representation learning. Journal of Machine Learning Research: Workshop and Conference Proceedings 37.
van der Maaten, LJP. 2014. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research, 15:3221-3245.
Hamel, P and D. Eck. 2010. Learning features from music audio with deep belief networks. 11th International Society for Music Information Retrieval Conference 339-344.
Jamieson AR et al. 2010. Exploring nonlinear feature space dimension reduction and data representation in breast CADx with Laplacian eigenmaps and t-SNE. Medical Physics 37:339-351.
van der Maaten, LJP. 2009. Learning a Parametric Embedding by Preserving Local Structure. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), Journal of Machine Learning Research Workshop and Conference Proceedings 5:384-391.
van der Maaten LJP and GE Hinton. 2008. Visualizing Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605.